Enhancing Urban Tree Planting through Air Quality Analysis
Use case scenario
As a: Resident of Melbourne who values sustainable city living and concerned about air pollution and its impact on daily health and wellbeing.
I want to: See key roadside areas in the city selected for tree planting projects based on their potential to reduce airborne pollutants.
So that I can: Breathe cleaner air in urban spaces, benefiting from reduced exposure to traffic-related emissions and a healthier living environment.
By:
Analysing microclimate sensor data to identify areas along Melbourne’s roadways with poor air quality and high exposure to vehicular emissions.
Mapping these pollution-prone zones against existing tree canopy data to locate streets with minimal natural filtration and shade coverage.
Integrating tree planting zone data to pinpoint feasible locations for vegetation expansion that align with municipal greening regulations and infrastructure constraints.
Selecting tree species with high pollutant absorption capabilities and modelling their expected impact on local air quality over time.
Applying geospatial analysis to overlay environmental indicators—such as pollutant levels, canopy gaps, and permissible planting areas—to inform strategic decision-making.
What this use case will teach you
This use case explores how to integrate diverse urban datasets—specifically microclimate sensor readings, tree canopy coverage, and planting zone data—to identify roadside locations where strategic tree planting can significantly improve air quality. By engaging with real-world environmental data, you will address pressing urban health challenges related to pollution exposure.
You will gain practical experience in applying geospatial analysis and leveraging microclimate insights to support data-informed decisions in sustainable urban planning.
The case study focuses on techniques for assessing how vegetation coverage, pollutant concentrations, and permissible planting zones intersect, equipping you with the tools to prioritise greening interventions that deliver measurable air quality benefits.
A key learning outcome will be the development of user-friendly dashboards and interactive visualisations to effectively communicate findings to stakeholders such as urban planners, environmental agencies, and community members.
You will also learn to evaluate the environmental and societal implications of targeted greening projects, enabling you to propose data-driven strategies for creating healthier, more breathable urban environments.
Project Goals and expected outcomes
This project showcases the capability to integrate and analyse multiple open datasets—specifically microclimate sensor data, existing tree canopy coverage, and designated planting zones—to identify roadside locations where new tree plantings can most effectively improve urban air quality.
The analysis will involve spatial and environmental evaluations to prioritise streets and road corridors that currently experience high levels of air pollution and lack sufficient vegetation. These assessments will guide decisions around optimal placement and species selection to maximise air filtration benefits.
The goal is to deliver actionable insights that contribute to cleaner urban air by enhancing tree coverage in strategically selected areas. The project supports Melbourne’s broader objectives for sustainability, public health, and environmental resilience.
A central output will be the development of an interactive, user-focused dashboard that visualises pollutant levels, canopy gaps, and suitable planting zones. This tool will empower urban planners, environmental stakeholders, and policymakers to make data-informed decisions.
Ultimately, the project aims to provide clear, evidence-based recommendations for improving air quality through targeted urban greening—enhancing both the environmental performance of city streets and the everyday wellbeing of Melbourne’s residents.
Initialisation¶
Importing the necessary libraries¶
import warnings
warnings.filterwarnings("ignore")
import pandas as pd # Data manipulation and analysis
import numpy as np # Numerical operations
from sklearn.impute import SimpleImputer #For handling missing values
import matplotlib.pyplot as plt # Basic plotting
import plotly.express as px # For complex plotting
import contextily as ctx # for geo spatial plotting
import seaborn as sns # Advanced statistical data visualization
import geopandas as gpd # For handling shapefiles, GeoJSON, etc.
import folium # For interactive mapping
import requests #For accessing datasets
from io import StringIO #For reading and writing with in a string in memory
from collections import defaultdict #For creating dictionary
import json #For parsing and reading json files
from IPython.display import IFrame #display saved visualisations
from shapely.geometry import shape, Point # For creating and manipulating Geometric Objects
%matplotlib inline
Importing the data through API from open data portal of Melbourne¶
The below function accesses open datasets via API endpoints, enabling users to obtain information in CSV format suitable for in-depth analysis. By providing the dataset identifier and a valid API key, it issues a request to the Melbourne data portal and interprets the response to retrieve pertinent data. This method streamlines the incorporation of diverse datasets—such as microclimate sensor, urban tree canopies, and tree planting zone, facilitating straightforward access and efficient data integration for applications in urban planning research.
def import_data(datasetname): # pass in dataset name and api key
"""
Imports a dataset from the City of Melbourne Open Data API.
Parameters:
- dataset_id (str): The unique dataset identifier.
Returns:
- pd.DataFrame: The imported dataset as a pandas DataFrame.
"""
dataset_id = datasetname
base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
apikey = api_key
dataset_id = dataset_id
format = 'csv'
url = f'{base_url}{dataset_id}/exports/{format}'
params = {
'select': '*',
'limit': -1, # all records
'lang': 'en',
'timezone': 'UTC'
}
# GET request
response = requests.get(url, params=params)
if response.status_code == 200:
# StringIO to read the CSV data
url_content = response.content.decode('utf-8')
datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
print(f' Imported the {dataset_id} dataset with {len(datasetname)} records succesfully \n')
return datasetname
else:
return (print(f'Request failed with status code {response.status_code}'))
Datasets¶
The following section illustrates how data is retrieved and prepared for analysis using multiple datasets. The import_data function is employed to access three key datasets: tree canopies, microclimate sensors, and tree planting zones. Each dataset is stored in separate pandas dataframe — tree_canopy_2021, microclimate, and tree_planting. The number of records retrieved from each dataset is displayed, providing a clear overview of the data volume available for exploration. This structured approach ensures that the datasets are properly organized and readily available for integration into analytical workflows within the project.
Importing dataset - tree-canopies-2021-urban-forest
About the dataset: The Tree Canopies 2021 - Urban Forest dataset maps the extent of tree canopy cover across the City of Melbourne using aerial imagery and LiDAR data. It provides detailed spatial insights into urban forest coverage, supporting initiatives in climate resilience, biodiversity, and urban planning.
tree_canopy_2021 = 'tree-canopies-2021-urban-forest'
df_tree_canopy_2021 = import_data(tree_canopy_2021)
df_tree_canopy_2021_orig = df_tree_canopy_2021 #saving the original dataset
print('First few rows of the dataset:\n',df_tree_canopy_2021.head(5))
Imported the tree-canopies-2021-urban-forest dataset with 57980 records succesfully
First few rows of the dataset:
geo_point_2d \
0 -37.77506304683423, 144.93898465421296
1 -37.775132956993566, 144.93979253397976
2 -37.77515941768105, 144.93986878938023
3 -37.775130649990984, 144.9389679196713
4 -37.77518700935825, 144.93875479828375
geo_shape
0 {"coordinates": [[[[144.9389624164712, -37.775...
1 {"coordinates": [[[[144.93978541786475, -37.77...
2 {"coordinates": [[[[144.93986368038625, -37.77...
3 {"coordinates": [[[[144.93894119058956, -37.77...
4 {"coordinates": [[[[144.93877858274755, -37.77...
Importing dataset - microclimate-sensors-data
About the dataset: The Microclimate Sensors Data dataset captures real-time environmental conditions across Melbourne using a network of sensors. It includes measurements such as air temperature, humidity, and solar radiation, supporting research in urban heat islands, climate monitoring, and sustainable city planning.
microclimate = 'microclimate-sensors-data'
df_microclimate = import_data(microclimate)
df_microclimate_orig = df_microclimate #saving the original dataset
print('First few rows of the dataset:\n',df_microclimate.head(5))
Imported the microclimate-sensors-data dataset with 292472 records succesfully
First few rows of the dataset:
device_id received_at \
0 ICTMicroclimate-08 2025-02-09T00:54:37+00:00
1 ICTMicroclimate-11 2025-02-09T01:02:11+00:00
2 ICTMicroclimate-05 2025-02-09T01:03:24+00:00
3 ICTMicroclimate-01 2025-02-09T01:02:43+00:00
4 ICTMicroclimate-09 2025-02-09T01:17:37+00:00
sensorlocation \
0 Swanston St - Tram Stop 13 adjacent Federation...
1 1 Treasury Place
2 Enterprize Park - Pole ID: COM1667
3 Birrarung Marr Park - Pole 1131
4 SkyFarm (Jeff's Shed). Rooftop - Melbourne Con...
latlong minimumwinddirection averagewinddirection \
0 -37.8184515, 144.9678474 0.0 153.0
1 -37.812888, 144.9750857 0.0 144.0
2 -37.8204083, 144.9591192 0.0 45.0
3 -37.8185931, 144.9716404 NaN 150.0
4 -37.8223306, 144.9521696 0.0 241.0
maximumwinddirection minimumwindspeed averagewindspeed gustwindspeed \
0 358.0 0.0 3.9 7.9
1 356.0 0.0 2.0 7.8
2 133.0 0.0 1.5 2.7
3 NaN NaN 1.6 NaN
4 359.0 0.0 0.9 4.4
airtemperature relativehumidity atmosphericpressure pm25 pm10 \
0 23.9 57.300000 1009.7 0.0 0.0
1 24.5 56.200000 1005.3 0.0 0.0
2 25.0 60.000000 1009.6 1.0 3.0
3 23.1 61.099998 1009.0 0.0 5.0
4 25.6 53.700000 1007.9 0.0 0.0
noise
0 80.500000
1 62.900000
2 68.500000
3 51.700001
4 60.200000
Importing dataset - microclimate-sensors-data
About the dataset: The Tree Planting Zone Schedules with Years - Urban Forest dataset outlines planned and completed tree planting activities across various city zones from 2017 onward. It provides scheduling information to support long-term urban greening strategies and enhance canopy coverage in the City of Melbourne.
tree_planting = 'tree-planting-zone-schedules-with-years-urban-forest'
df_tree_planting = import_data(tree_planting)
df_tree_planting_orig = df_tree_planting #saving the original dataset
print('First few rows of the dataset:\n',df_tree_planting.head(5))
Imported the tree-planting-zone-schedules-with-years-urban-forest dataset with 839 records succesfully
First few rows of the dataset:
geo_point_2d \
0 -37.8030612625994, 144.96879535330316
1 -37.81979235869452, 144.9675112730786
2 -37.796602429685905, 144.96990674715127
3 -37.79774465832566, 144.9502620746376
4 -37.81652483951976, 144.9864951838192
geo_shape str_from segpart \
0 {"coordinates": [[[[144.969387175266, -37.8029... NaN NaN
1 {"coordinates": [[[[144.96828098035, -37.81969... NaN NaN
2 {"coordinates": [[[[144.969341164027, -37.7965... NaN NaN
3 {"coordinates": [[[[144.950283591209, -37.7975... NaN NaN
4 {"coordinates": [[[[144.986101797552, -37.8163... NaN NaN
statusid segid streetname streetid schedule \
0 NaN 21556 NaN NaN Not determined by precinct plan
1 NaN 22067 NaN NaN Years 8 - 10
2 NaN 20697 NaN NaN Years 5 - 7
3 NaN 21195 NaN NaN Complete
4 NaN 21945 NaN NaN Not determined by precinct plan
mapstatus str_to segdescr
0 NaN NaN Pelham Street between Rathdowne Street and Dru...
1 NaN NaN Riverside Avenue between St Kilda Road and Sou...
2 NaN NaN Little Palmerston Street between Rathdowne Str...
3 NaN NaN Chapman Street between Errol Street and Harker...
4 NaN NaN Wellington Parade between Simpson Street and P...
Data Cleansing and Preprocessing¶
The Data Cleansing and Preprocessing phase focuses on preparing the tree canopies, microclimate sensors, and tree planting zones datasets for analysis. This involves resolving inconsistencies, handling missing entries, and reformatting data as needed—such as separating latitude and longitude fields, removing redundant columns, and ensuring appropriate structure across datasets. These steps are critical to harmonize the datasets for seamless integration and analysis. By standardizing and validating the data, this process enhances the accuracy and reliability of any insights derived.
def split_geo_coordinates(df, geo_column):
"""
Splits a combined latitude,longitude column into two separate float columns: 'latitude' and 'longitude'.
Parameters:
- df (pd.DataFrame): The input DataFrame containing the geo column.
- geo_column (str): The name of the column with 'latitude,longitude' string values.
Returns:
- pd.DataFrame: A new DataFrame with separate 'latitude' and 'longitude' columns.
"""
if geo_column not in df.columns:
raise ValueError(f"Column '{geo_column}' not found in DataFrame.")
try:
# Ensure the geo_column is of string type
df[geo_column] = df[geo_column].astype(str)
# Attempt to split the column
split_data = df[geo_column].str.split(',', expand=True)
if split_data.shape[1] != 2:
raise ValueError(f"Column '{geo_column}' does not contain valid 'latitude,longitude' format.")
df['latitude'] = pd.to_numeric(split_data[0], errors='coerce')
df['longitude'] = pd.to_numeric(split_data[1], errors='coerce')
# Drop rows with invalid coordinates
df.dropna(subset=['latitude', 'longitude'], inplace=True)
# Drop the original geo column
df = df.drop(columns=[geo_column])
print('Dataset Info after Geo Split:\n')
print(df.info())
except Exception as e:
print(f"An error occurred during geolocation splitting: {e}")
raise
return df
def check_preprocess_dataset(df_dataset, dataset_name='dataset'):
"""
Inspects and preprocesses a dataset:
- Prints dataset info
- Checks for missing values
- Removes duplicate rows (if any)
Parameters:
- df_dataset (pd.DataFrame): The input DataFrame to be checked and cleaned.
- dataset_name (str): Optional name of the dataset for logging purposes.
Returns:
- pd.DataFrame: A cleaned version of the input DataFrame.
"""
try:
if not isinstance(df_dataset, pd.DataFrame):
raise TypeError("Input is not a pandas DataFrame.")
print(f'Dataset Information for "{dataset_name}":\n')
print(df_dataset.info())
# Check for missing values
print(f'\nMissing values in "{dataset_name}" dataset:\n')
print(df_dataset.isnull().sum())
# Identify and remove duplicates
dupes = df_dataset.duplicated().sum()
if dupes > 0:
df_dataset = df_dataset.drop_duplicates()
print(f'\nDeleted {dupes} duplicate record(s) from "{dataset_name}".')
else:
print(f'\nNo duplicate records found in "{dataset_name}".')
except Exception as e:
print(f"An error occurred while preprocessing '{dataset_name}': {e}")
raise
return df_dataset
Tree Canopies 2021 dataset
df_tree_canopy_2021 = check_preprocess_dataset(df_tree_canopy_2021, 'Tree Canopies 2021')
Dataset Information for "Tree Canopies 2021": <class 'pandas.core.frame.DataFrame'> RangeIndex: 57980 entries, 0 to 57979 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 57980 non-null object 1 geo_shape 57980 non-null object dtypes: object(2) memory usage: 906.1+ KB None Missing values in "Tree Canopies 2021" dataset: geo_point_2d 0 geo_shape 0 dtype: int64 No duplicate records found in "Tree Canopies 2021".
To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.
#splitting geo coordinates
df_tree_canopy_2021 = split_geo_coordinates(df_tree_canopy_2021,'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n',df_tree_canopy_2021.head(5))
Dataset Info after Geo Split:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57980 entries, 0 to 57979
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geo_shape 57980 non-null object
1 latitude 57980 non-null float64
2 longitude 57980 non-null float64
dtypes: float64(2), object(1)
memory usage: 1.3+ MB
None
First few rows of the dataset after preprocessing:
geo_shape latitude longitude
0 {"coordinates": [[[[144.9389624164712, -37.775... -37.775063 144.938985
1 {"coordinates": [[[[144.93978541786475, -37.77... -37.775133 144.939793
2 {"coordinates": [[[[144.93986368038625, -37.77... -37.775159 144.939869
3 {"coordinates": [[[[144.93894119058956, -37.77... -37.775131 144.938968
4 {"coordinates": [[[[144.93877858274755, -37.77... -37.775187 144.938755
Microclimate Sensor dataset
#df_microclimate = df_microclimate_orig
df_microclimate = check_preprocess_dataset(df_microclimate, 'Microclimate Sensor')
Dataset Information for "Microclimate Sensor": <class 'pandas.core.frame.DataFrame'> RangeIndex: 292472 entries, 0 to 292471 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 device_id 292472 non-null object 1 received_at 292472 non-null object 2 sensorlocation 286329 non-null object 3 latlong 280989 non-null object 4 minimumwinddirection 262620 non-null float64 5 averagewinddirection 292002 non-null float64 6 maximumwinddirection 262462 non-null float64 7 minimumwindspeed 262462 non-null float64 8 averagewindspeed 292002 non-null float64 9 gustwindspeed 262462 non-null float64 10 airtemperature 292002 non-null float64 11 relativehumidity 292002 non-null float64 12 atmosphericpressure 292002 non-null float64 13 pm25 279254 non-null float64 14 pm10 279254 non-null float64 15 noise 279254 non-null float64 dtypes: float64(12), object(4) memory usage: 35.7+ MB None Missing values in "Microclimate Sensor" dataset: device_id 0 received_at 0 sensorlocation 6143 latlong 11483 minimumwinddirection 29852 averagewinddirection 470 maximumwinddirection 30010 minimumwindspeed 30010 averagewindspeed 470 gustwindspeed 30010 airtemperature 470 relativehumidity 470 atmosphericpressure 470 pm25 13218 pm10 13218 noise 13218 dtype: int64 No duplicate records found in "Microclimate Sensor".
#selecting relevant columns
df_microclimate = df_microclimate[['device_id', 'received_at',
'sensorlocation', 'latlong','airtemperature',
'relativehumidity', 'pm25', 'pm10', 'noise']
]
print("\nMissing Values:")
print(df_microclimate.isna().sum())
Missing Values: device_id 0 received_at 0 sensorlocation 6143 latlong 11483 airtemperature 470 relativehumidity 470 pm25 13218 pm10 13218 noise 13218 dtype: int64
Handling Missing Values in sensorlocation and latlong Columns:
To ensure the reliability and integrity of the microclimate sensor dataset, we continue the preprocessing by addressing missing values in two key columns: sensorlocation and latlong.
Given that the device_id field contains no missing values, it serves as a stable identifier for filling the missing values in sensorlocation and latlong.
The approach is to first validate that each device has a consistent sensorlocation and latlong throughout the dataset.
def find_unique_counts(df, pivot_col, check_col):
"""
Computes the number of unique values in `check_col` for each unique value in `pivot_col`.
Parameters:
- df (pd.DataFrame): Input DataFrame containing the columns to be grouped and counted.
- pivot_col (str): Column name to group by.
- check_col (str): Column name for which unique counts are calculated.
"""
try:
# Validate input types
if not isinstance(df, pd.DataFrame):
raise TypeError("Input `df` must be a pandas DataFrame.")
if pivot_col not in df.columns:
raise ValueError(f"'{pivot_col}' column not found in DataFrame.")
if check_col not in df.columns:
raise ValueError(f"'{check_col}' column not found in DataFrame.")
# Perform groupby and unique count
sensorloc_counts = df.groupby(pivot_col)[check_col].nunique().reset_index()
count_col = f'unique_{check_col}_count'
sensorloc_counts.columns = [pivot_col, count_col]
print(f'\nUnique "{check_col}" count per "{pivot_col}":')
print(sensorloc_counts)
except Exception as e:
print(f"An error occurred while calculating unique counts: {e}")
raise
find_unique_counts(df_microclimate,'device_id','sensorlocation')
Unique "sensorlocation" count per "device_id":
device_id unique_sensorlocation_count
0 ICTMicroclimate-01 1
1 ICTMicroclimate-02 1
2 ICTMicroclimate-03 1
3 ICTMicroclimate-04 1
4 ICTMicroclimate-05 1
5 ICTMicroclimate-06 1
6 ICTMicroclimate-07 1
7 ICTMicroclimate-08 1
8 ICTMicroclimate-09 1
9 ICTMicroclimate-10 1
10 ICTMicroclimate-11 1
11 aws5-0999 1
find_unique_counts(df_microclimate,'device_id','latlong')
Unique "latlong" count per "device_id":
device_id unique_latlong_count
0 ICTMicroclimate-01 1
1 ICTMicroclimate-02 1
2 ICTMicroclimate-03 1
3 ICTMicroclimate-04 1
4 ICTMicroclimate-05 1
5 ICTMicroclimate-06 1
6 ICTMicroclimate-07 1
7 ICTMicroclimate-08 1
8 ICTMicroclimate-09 1
9 ICTMicroclimate-10 1
10 ICTMicroclimate-11 1
11 aws5-0999 1
The above check show that each device has a consistent sensorlocation and latlong. Hence, for each unique device_id, we determine the unique value for both sensorlocation and latlong.
Approach:
Step 1: Group by device_id and extract unique values for sensorlocation and latlong. This helps identify the expected value associated with each device.
Step 2: For each missing sensorlocation, impute the value using the known unique mapping between device_id and sensorlocation.
Step 3: Repeat a similar process for latlong, filling in the missing values using the corresponding known latlong values for each device_id.
The imputation is performed independently for each of the two columns to maintain clarity and avoid propagation of errors.
This method ensures consistency by maintaining the integrity of device-specific metadata, which is critical for further analysis.
print(df_microclimate[['sensorlocation', 'latlong']].isna().sum())
sensorlocation 6143 latlong 11483 dtype: int64
def fill_missing_values_using_device_id(df, pivot_col, fill_col):
"""
Fills missing values in `fill_col` based on unique mapping from `device_col`
for devices that have exactly one known value in `fill_col`.
Parameters:
- df (pd.DataFrame): Input DataFrame
- device_col (str): Column containing the device identifier
- fill_col (str): Column to be filled based on unique device mapping
Returns:
- pd.DataFrame: DataFrame with missing values in `fill_col` filled where possible
"""
try:
if not isinstance(df, pd.DataFrame):
raise TypeError("Input `df` must be a pandas DataFrame.")
if pivot_col not in df.columns:
raise ValueError(f"'{pivot_col}' column not found in DataFrame.")
if fill_col not in df.columns:
raise ValueError(f"'{fill_col}' column not found in DataFrame.")
# Subset rows where both columns are not missing
valid_rows = df.dropna(subset=[pivot_col, fill_col])
# Count unique fill_col values per device
unique_counts = valid_rows.groupby(pivot_col)[fill_col].nunique()
# Filter devices with only 1 unique fill_col value
single_value_devices = unique_counts[unique_counts == 1].index
# Create mapping from device_id to that one known value
mapping = (
valid_rows[valid_rows[pivot_col].isin(single_value_devices)]
.groupby(pivot_col)[fill_col]
.first()
)
# Fill missing values using map
df[fill_col] = df[fill_col].fillna(df[pivot_col].map(mapping))
print(f'Missing values in "{fill_col}" filled using "{pivot_col}" where uniquely possible.')
except Exception as e:
print(f"An error occurred while filling missing values: {e}")
raise
return df
df_microclimate = fill_missing_values_using_device_id(df_microclimate,'device_id', 'sensorlocation')
df_microclimate = fill_missing_values_using_device_id(df_microclimate,'device_id', 'latlong')
Missing values in "sensorlocation" filled using "device_id" where uniquely possible. Missing values in "latlong" filled using "device_id" where uniquely possible.
print(df_microclimate[['sensorlocation', 'latlong']].isna().sum())
sensorlocation 0 latlong 0 dtype: int64
The received_at column is converted to a proper datetime object to enable time-series analysis.
# Convert 'received_at' column to datetime format
df_microclimate['received_at'] = pd.to_datetime(df_microclimate['received_at'], errors='coerce')
print(df_microclimate['received_at'].head())
print(df_microclimate['received_at'].dtypes)
0 2025-02-09 00:54:37+00:00 1 2025-02-09 01:02:11+00:00 2 2025-02-09 01:03:24+00:00 3 2025-02-09 01:02:43+00:00 4 2025-02-09 01:17:37+00:00 Name: received_at, dtype: datetime64[ns, UTC] datetime64[ns, UTC]
In order to addess the missing values in the rest of the numerical columns such as airtemperature, relativehumidity, pm25, pm10, noise, the dataset is first sorted by device_id and received_at to ensure chronological order for each sensor. Missing values in numeric columns are then filled using linear interpolation, which estimates values based on the average of preceding and following readings within each device group.
# Sort the DataFrame by device_id and received_at for time series continuity
df_microclimate.sort_values(by=['device_id', 'received_at'], inplace=True)
# Select numeric columns
numeric_cols = df_microclimate.select_dtypes(include=[np.number]).columns
# Apply linear interpolation within each device_id group
df_microclimate[numeric_cols] = (
df_microclimate.groupby('device_id')[numeric_cols]
.apply(lambda group: group.interpolate(method='linear', limit_direction='both'))
.reset_index(drop=True)
)
# Check if missing values remain
print(df_microclimate[numeric_cols].isna().sum())
airtemperature 0 relativehumidity 0 pm25 12748 pm10 12748 noise 12748 dtype: int64
To facilitate spatial analysis, the latlong column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original latlong column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.
#splitting geo coordinates
df_microclimate = split_geo_coordinates(df_microclimate,'latlong')
print('First few rows of the dataset after preprocessing:\n',df_microclimate.head(5))
Dataset Info after Geo Split:
<class 'pandas.core.frame.DataFrame'>
Index: 292472 entries, 93241 to 292466
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 device_id 292472 non-null object
1 received_at 292472 non-null datetime64[ns, UTC]
2 sensorlocation 292472 non-null object
3 airtemperature 292472 non-null float64
4 relativehumidity 292472 non-null float64
5 pm25 279724 non-null float64
6 pm10 279724 non-null float64
7 noise 279724 non-null float64
8 latitude 292472 non-null float64
9 longitude 292472 non-null float64
dtypes: datetime64[ns, UTC](1), float64(7), object(2)
memory usage: 24.5+ MB
None
First few rows of the dataset after preprocessing:
device_id received_at \
93241 ICTMicroclimate-01 2024-05-29 03:57:29+00:00
26101 ICTMicroclimate-01 2024-05-29 04:12:30+00:00
42960 ICTMicroclimate-01 2024-05-29 04:27:30+00:00
42972 ICTMicroclimate-01 2024-05-29 04:42:31+00:00
93244 ICTMicroclimate-01 2024-05-29 04:57:31+00:00
sensorlocation airtemperature relativehumidity \
93241 Birrarung Marr Park - Pole 1131 18.5 38.800000
26101 Birrarung Marr Park - Pole 1131 24.0 57.599998
42960 Birrarung Marr Park - Pole 1131 21.2 61.700000
42972 Birrarung Marr Park - Pole 1131 19.4 68.400000
93244 Birrarung Marr Park - Pole 1131 18.9 36.700000
pm25 pm10 noise latitude longitude
93241 2.0 3.0 66.300000 -37.818593 144.97164
26101 4.0 9.0 58.400002 -37.818593 144.97164
42960 8.0 11.0 69.400000 -37.818593 144.97164
42972 4.0 6.0 69.800000 -37.818593 144.97164
93244 2.0 3.0 62.300000 -37.818593 144.97164
Tree Planting Zone dataset
df_tree_planting = check_preprocess_dataset(df_tree_planting, 'Tree Planting Zone')
Dataset Information for "Tree Planting Zone": <class 'pandas.core.frame.DataFrame'> RangeIndex: 839 entries, 0 to 838 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 839 non-null object 1 geo_shape 839 non-null object 2 str_from 0 non-null float64 3 segpart 0 non-null float64 4 statusid 0 non-null float64 5 segid 839 non-null int64 6 streetname 0 non-null float64 7 streetid 0 non-null float64 8 schedule 839 non-null object 9 mapstatus 0 non-null float64 10 str_to 0 non-null float64 11 segdescr 839 non-null object dtypes: float64(7), int64(1), object(4) memory usage: 78.8+ KB None Missing values in "Tree Planting Zone" dataset: geo_point_2d 0 geo_shape 0 str_from 839 segpart 839 statusid 839 segid 0 streetname 839 streetid 839 schedule 0 mapstatus 839 str_to 839 segdescr 0 dtype: int64 No duplicate records found in "Tree Planting Zone".
#selecting relevant columns
df_tree_planting = df_tree_planting[['geo_point_2d', 'geo_shape', 'segid', 'schedule','segdescr']]
To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.
df_tree_planting = split_geo_coordinates(df_tree_planting,'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n',df_tree_planting.head(5))
Dataset Info after Geo Split:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839 entries, 0 to 838
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geo_shape 839 non-null object
1 segid 839 non-null int64
2 schedule 839 non-null object
3 segdescr 839 non-null object
4 latitude 839 non-null float64
5 longitude 839 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 39.5+ KB
None
First few rows of the dataset after preprocessing:
geo_shape segid \
0 {"coordinates": [[[[144.969387175266, -37.8029... 21556
1 {"coordinates": [[[[144.96828098035, -37.81969... 22067
2 {"coordinates": [[[[144.969341164027, -37.7965... 20697
3 {"coordinates": [[[[144.950283591209, -37.7975... 21195
4 {"coordinates": [[[[144.986101797552, -37.8163... 21945
schedule \
0 Not determined by precinct plan
1 Years 8 - 10
2 Years 5 - 7
3 Complete
4 Not determined by precinct plan
segdescr latitude longitude
0 Pelham Street between Rathdowne Street and Dru... -37.803061 144.968795
1 Riverside Avenue between St Kilda Road and Sou... -37.819792 144.967511
2 Little Palmerston Street between Rathdowne Str... -37.796602 144.969907
3 Chapman Street between Errol Street and Harker... -37.797745 144.950262
4 Wellington Parade between Simpson Street and P... -37.816525 144.986495
Data Analysis and Visualisation¶
Tree Canopies 2021 dataset¶
The below visualisation of illustrates the spatial extent of tree canopy coverage across Melbourne by transforming polygon data into geospatial geometries. The processed shapes were plotted on a static map to highlight zones with dense vegetation.
# convert Tree Canopies dataset into a GeoDataFrame
df_tree_canopy_2021['geometry'] = df_tree_canopy_2021['geo_shape'].apply(lambda x: shape(json.loads(x)))
# Create GeoDataFrame
gdf = gpd.GeoDataFrame(df_tree_canopy_2021, geometry='geometry', crs='EPSG:4326')
# Plot static map
fig, ax = plt.subplots(figsize=(12, 10))
gdf.plot(ax=ax, color='green', edgecolor='darkgreen', alpha=1, label='Canopies')
ax.set_title('Tree Canopy Coverage (2021) - Melbourne', fontsize=14)
ax.set_xlabel("Longitude", fontsize=10)
ax.set_ylabel("Latitude", fontsize=10)
plt.show()
The chart shows a dense distribution of tree canopies in specific parts of the region, highlighted in green. These visualizations are critical for identifying areas that already have substantial green coverage and pinpointing regions that require further intervention to enhance urban greenery. This analysis supports strategic planning for biodiversity conservation, reduction of urban heat islands, and improving the quality of public spaces.
In the below visualisation, segmenting the geographic area into a grid by binning both latitude and longitude values, then counting how many tree canopy points fall into each grid cell. These counts are visualised as a 2D heatmap, where darker green shades represent higher tree canopy density across Melbourne.
# Bin both longitude and latitude
df_tree_canopy_2021['longitude_bins'] = pd.cut(df_tree_canopy_2021['longitude'], bins=5)
df_tree_canopy_2021['latitude_bins'] = pd.cut(df_tree_canopy_2021['latitude'], bins=5)
# Group by both bins
lat_lon_counts = df_tree_canopy_2021.groupby(['latitude_bins', 'longitude_bins']).size().unstack(fill_value=0)
# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(lat_lon_counts, cmap='Greens', annot=True, fmt='d', linewidths=0.5)
plt.title("Tree Canopy Distribution by Latitude and Longitude Bins", fontsize=14)
plt.xlabel("Longitude Range")
plt.ylabel("Latitude Range")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
From the above heatmap, noting that the central and southeastern grid cells show the highest tree canopy density, indicating dense vegetation in those zones. In contrast, the northwest and outer edges have significantly fewer data points, suggesting lower canopy coverage or less vegetation data available in those areas. This spatial insight is valuable for targeting urban greening efforts where canopy presence is sparse.
Microclimate Sensor dataset¶
The microclimate dataset analysis focuses on two critical aspects: air temperature variation and PM2.5 pollution levels. The process begins by adding a month column to each record, which enables the grouping of data by both month and device. Next, average values for PM2.5 and air temperature are computed for each sensor across the months. This step prepares the data to reveal seasonal trends and sensor-specific behaviour. The resulting averages help in understanding environmental patterns that are crucial for urban planning and pollution management.
The box-and-whisker format helps identify both the typical range (the box) and the more extreme daily averages (the whiskers and any outliers). The taller boxes or higher medians tend to have more elevated PM2.5 & Air Temperature levels or wider day-to-day fluctuations. The spread of the boxes reveals how stable or variable temperatures are on a daily basis, and outliers may point to unusual temperature spikes or dips on specific days.
# Create a 'month' column
df_microclimate['month'] = df_microclimate['received_at'].dt.to_period('M').astype(str)
# Group by month and device to calculate the average PM2.5 * Air Temperature
avg_pm25 = df_microclimate.groupby(['month', 'device_id'])['pm25'].mean().reset_index()
avg_air_temp = df_microclimate.groupby(['month', 'device_id'])['airtemperature'].mean().reset_index()
# Create a box plot: x-axis as month, y-axis as average PM2.5
fig1 = px.box(avg_pm25,
x='month',
y='pm25',
title="Distribution of Average PM2.5 per Month Across Devices",
labels={"pm25": "Average PM2.5", "month": "Month"})
fig1.show()
# Create a box plot: x-axis as month, y-axis as average Air Temperature
fig2 = px.box(avg_air_temp,
x='month',
y='airtemperature',
title="Distribution of Average Air Temperature per Month Across Devices",
labels={"airtemperature": "Average Air Temperature", "month": "Month"})
fig2.show()